-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: use EncodedError in SnapshotResponse #75248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/replica_command.go, line 2512 at r2 (raw file):
return err } return nil
maybe just:
return r.store.cfg.Transport.SendSnapshot(
ctx,
r.store.allocator.storePool,
req,
snap,
newBatchFn,
sent,
)
instead of everything? or maybe you like the existing pattern :)
Suggestion:
if err := r.store.cfg.Transport.SendSnapshot(
ctx,
r.store.allocator.storePool,
req,
snap,
newBatchFn,
sent,
); err != nil {
return err
}
return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR, Lidor!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)
pkg/kv/kvserver/replica_command.go, line 2512 at r2 (raw file):
Previously, lidorcarmel (Lidor Carmel) wrote…
maybe just:
return r.store.cfg.Transport.SendSnapshot( ctx, r.store.allocator.storePool, req, snap, newBatchFn, sent, )
instead of everything? or maybe you like the existing pattern :)
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)
These take >1m under race and are now skipped for the race build. Note that TestLearnerSnapshotFailsRollback takes 90s even under ideal conditions, though that is fixed once I get cockroachdb#75248 over the finish line. At that point it probably wouldn't figure prominently as a slow test, but until then it's verrrry slow (>7m, probably since race stuff slows down the longer it runs) so still good to skip it now. https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1646221664032729 Release justification: testing-only change Release note: None
77272: kvserver: skip some slow tests under race r=erikgrinaker a=tbg These take >1m under race and are now skipped for the race build. Note that TestLearnerSnapshotFailsRollback takes 90s even under ideal conditions, though that is fixed once I get #75248 over the finish line. At that point it probably wouldn't figure prominently as a slow test, but until then it's verrrry slow (>7m, probably since race stuff slows down the longer it runs) so still good to skip it now. https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1646221664032729 Release justification: testing-only change Release note: None Co-authored-by: Tobias Grieger <[email protected]>
e01eb26
to
8380321
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd noticed this recently while working on #77246 but couldn't see an easy way to work around it. TIL about errors.EncodedError
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg should we merge this?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel)
I had put it off a bit to avoid changing |
bors r=aayushshah15 |
Build failed (retrying...): |
bors r- Looks like I need to shake out some flakes. |
Canceled. |
8bf769d
to
61d8135
Compare
Rebased. Tests seem to pass reliably (at least kvserver package), but this doesn't seem good:
Going to take another look at slow tests before merging this. I bet lots of those aren't new:
|
Ironically TestLearnerSnapshotFailsRollback still takes 45s for each of the two cases 😆 well I'll go through these tests. |
09da176
to
8310d5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @lidorcarmel from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lidorcarmel)
PTAL @aayushshah15
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @lidorcarmel)
We were previously using a "Message" string to indicate details about an error. We can do so much better now and actually encode the error. This wasn't possible when this field was first added, but it is now, so let's use it. As always, there's a migration concern, which means the old field stays around & is populated as well as interpreted for one release. We then use this new-found freedom to improve which errors were marked as "failed snapshot" errors. Previously, any error coming in on a `SnapshotResponse` were considered snapshot errors and were considered retriable. This was causing `TestLearnerSnapshotFailsRollback` to run for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to retry snapshot errors - but that test actually injects an error that it wants to fail-fast. Now, since snapshot error marks propagate over the wire, we can do the marking on the *sender* of the SnapshotResponse, and we can only mark messages that correspond to an actual failure to apply the snapshot (as opposed to an injected error, or a hard error due to a malformed request). The test now takes around one second, for a rare 90x speed-up. As a drive-by, we're also removing `errMalformedSnapshot`, which became unused when we stopped sending the raft log in raft snaps a few releases back, and which had managed to hide from the `unused` lint. [SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631 Fixes cockroachdb#74621. Release note: None
bors r=aayushshah15 |
This PR was included in a batch that timed out, it will be automatically retried |
Build failed (retrying...): |
Build succeeded: |
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.
We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
SnapshotResponse
were considered snapshot errors and were consideredretriable. This was causing
TestLearnerSnapshotFailsRollback
to runfor 90s, as
TestCluster
's replication changes use a SucceedsSoon toretry snapshot errors - but that test actually injects an error that it
wants to fail-fast. Now, since snapshot error marks propagate over the
wire, we can do the marking on the sender of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.
As a drive-by, we're also removing
errMalformedSnapshot
, which becameunused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the
unused
lint.Fixes #74621.
Fixes #87337.
Release note: None